Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Sequencing and Raw Sequence Data Quality Control ◾ 13

for good quality in the high-throughput sequencing (HTS). Table 1.2 shows some of the Q

scores and corresponding error probability, base call accuracy, and interpretation.

1.4 FASTQ FILES

The sequencing technologies like Illumina are provided with the Real-Time Analysis (RTA)

software that stores individual base call data in intermediate files called BCL files. When

the sequencing run completes, these BCL files are filtered, demultiplexed if the samples are

multiplexed, and then converted into a sequence file format called FASTQ. There will be a

single FASTQ file for each sample for a single-end run and two FASTQ files (R1 and R2)

for each sample for a paired-end run: R1 file for forward reads and R2 for reverse reads.

The FASTQ files are usually compressed and they may have the file extension “*.fastq.gz”.

A FASTQ [7] file is a human-readable file format that has become de facto standard for

storing the output of most HTS technologies. A FASTQ file consists of a number of records,

with each record having four lines of data as shown in Figure 1.6.

The first line of each record of a FASTQ file begins with the “@” symbol and this line is

called the read identifier since it identifies the sequence (read). A typical FASTQ identifier

line of the reads generated by an illumine instrument looks as follows:

@<instrument>:<run num>:<flowcell ID>:<lane>:<tile>:<x>:<y>:<UMI>

Table 1.3 describes the elements of the Illumina FASTQ identifier line and Figure 1.6

shows an example FASTQ file with three read records. The sequence observed in the index

sequence (part of the adaptor) is written to the FASTQ header in place of the sample num-

ber. This information can be useful for troubleshooting and demultiplexing. However,

these metadata elements may be altered or replaced by other elements especially when they

are submitted to a database or altered by users.

The second line of the FASTQ file contains the bases inferred by the sequencer. The

bases include A, C, G, and T for Adenine, Cytosine, Guanine, and Thymine, respectively.

The character N may be included if the base in a position is ambiguous (was not deter-

mined due to a sequencing fault).

The third line starts with a plus sign “+”, and it may contain other additional metadata

or the same identifier line elements.

TABLE 1.2 Phred Quality Score and Error Probability and Base Call Accuracy

Error Probability

Base Call Accuracy (%)

Interpretation

0.1

1 error in 10 calls

0.01

1 error in 100 calls

0.001

99.9

1 error in 1,000 calls

0.0001

99.99

1 error in 10,000 calls

0.00001

99.999

1 error in 100,000 calls

0.000001

99.9999

1 error in 1000,000 calls